15.1 Poisson and negative binomial regression:

The folder RiskyBehavior contains data from a randomized trial targeting couples at high risk of HIV infection. The intervention provided counseling sessions regarding practices that could reduce their likelihood of contracting HIV. Couples were randomized either to a control group, a group in which just the woman participated, or a group in which both members of the couple participated. One of the outcomes examined after three months was “number of unprotected sex acts.”

a)

Model this outcome as a function of treatment assignment using a Poisson regression. Does the model fit well? Is there evidence of overdispersion?

risk <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/RiskyBehavior/data/risky.csv",header=T)
risk$fupacts_R = round(risk$fupacts)

risk$women_alone <- as.factor(risk$women_alone)
risk$couples <- as.factor(risk$couples)
risk$sex <- as.factor(risk$sex)
risk$bs_hiv <- as.factor(risk$bs_hiv)

To summarize:

  • sex is the sex of the person, recorded as “man” or “woman” here
  • couples is an indicator for if the couple was counseled together
  • women_alone is an indicator for if the woman went to counseling by herself
  • bs_hiv indicates if the individual is HIV positive
  • bupacts is the number of unprotected sex acts reported as a baseline (before treatment)
  • fupacts is the number of unprotected sex acts reported at the end of the study
# Fit the model

fit_15.1a <- stan_glm (fupacts_R ~ couples + women_alone, family = poisson (link = 'log'), data = risk, refresh = 0)

# Display the model
summary
## standardGeneric for "summary" defined from package "base"
## 
## function (object, ...) 
## standardGeneric("summary")
## <environment: 0x7f88d66b4588>
## Methods may be defined for arguments: object
## Use  showMethods(summary)  for currently available ones.
# Posterior predictive check

pp_check(fit_15.1a)

# residual vs predicted plot (can also be done with ggplot)

yhat=predict(fit_15.1a, type="response")
plot(yhat, resid(fit_15.1a), xlab="predicted", ylab="residual",
     main="Residuals vs Predicted values", pch=20)
abline(0, 0, col="gray", lwd=.5)

# Use a rootogram
rootogram(fit_15.1a)

# Can also use a dispersiontest() to check goodness of fit or a 

The model is not fitting the data well as we can see from the posterior predictive check and the rootogram. There is evidence of overdispersion as the residuals are very large.

b)

Next extend the model to include pre-treatment measures of the outcome and the additional pre-treatment variables included in the dataset. Does the model fit well? Is there evidence of overdispersion?

# Fit the model
fit_15.1b <- stan_glm (fupacts_R ~ couples + women_alone + bs_hiv + bupacts + sex, family = poisson (link = 'log'), data = risk, refresh = 0)

# Posterior predictive check
pp_check(fit_15.1b)

# Residual vs predicted plot
yhat=predict(fit_15.1b, type="response")
plot(yhat, resid(fit_15.1b), xlab="predicted", ylab="residual",
     main="Residuals vs Predicted values", pch=20)

# Use a rootogram
rootogram(fit_15.1b)

Similar to the previous fit, the model is not fitting the data well as we can see from the posterior predictive check and rootogram. There is still evidence of overdispersion as the residuals are very spread out.

c)

Fit a negative binomial (overdispersed Poisson) model. What do you conclude regarding effectiveness of the intervention?

# Fit the model

fit_15.1c <- glm.nb (fupacts_R ~ couples + women_alone + bs_hiv + bupacts + sex, link = 'log', data = risk)

# Residual vs predicted plot
pred = predict(fit_15.1c, type="response")
resid = resid(fit_15.1c)
plot(pred, resid, xlab="predicted", ylab="residual",
     main="Residuals vs Predicted values", pch=20)

# Use a rootogram
rootogram(fit_15.1c)

The rootogram shows us that the model clearly fits better than before, the residuals are smaller but the plot is strange looking and does not offer much info.

d)

These data include responses from both men and women from the participating couples. Does this give you any concern with regard to our modeling assumptions?

The data might have some collineraity and there could be correlated errors. this would be because both of the people in one couple would have the same responses.

15.3 Binomial regression:

Redo the basketball shooting example on page 270, making some changes:

(a)

Instead of having each player shoot 20 times, let the number of shots per player vary, drawn from the uniform distribution between 10 and 30.

N <- 100
   height <- rnorm(N, 72, 3)
   p <- 0.4 + 0.1*(height - 72)/3
   n <- round(runif(100, 10, 30),0) # round so we have a whole number for n
   y <- rbinom(N, n, p)
   data <- data.frame(n=n, y=y, height=height)

# Fit the model
fit_15.3a <- stan_glm(cbind(y, n-y) ~ height, family=binomial(link="logit"), data = data, refresh = 0)

# Display the model
fit_15.3a
## stan_glm
##  family:       binomial [logit]
##  formula:      cbind(y, n - y) ~ height
##  observations: 100
##  predictors:   2
## ------
##             Median MAD_SD
## (Intercept) -10.5    1.3 
## height        0.1    0.0 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

(b)

Instead of having the true probability of success be linear, have the true probability be a logistic function, set so that Pr(success) = 0.3 for a player who is 5’9" and 0.4 for a 6’ tall player.

N <- 100
   height <- rnorm(N, 72, 3)
   p <- invlogit(-0.4 + 0.4*((height - 72)/3))
   n <- round(runif(100, 10, 30), 0)
   y <- rbinom(N, n, p)
   data <- data.frame(n=n, y=y, height=height)

# Fit the model
fit_15.3b <- stan_glm(cbind(y, n-y) ~ height, family=binomial(link="logit"), data=data, refresh=0)

# Display the model
fit_15.3b
## stan_glm
##  family:       binomial [logit]
##  formula:      cbind(y, n - y) ~ height
##  observations: 100
##  predictors:   2
## ------
##             Median MAD_SD
## (Intercept) -10.4    1.2 
## height        0.1    0.0 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

15.7 Tobit model for mixed discrete/continuous data:

Experimental data from the National Supported Work example are in the folder Lalonde. Use the treatment indicator and pre-treatment variables to predict post-treatment (1978) earnings using a Tobit model. Interpret the model coefficients.

lalonde = foreign::read.dta("https://github.com/avehtari/ROS-Examples/blob/master/Lalonde/NSW_dw_obs.dta?raw=true")

# Fit the model
fit_15.7 <- vglm(re78 ~ treat + age + married + sample + educ_cat4 + educ + black, tobit(), data=lalonde)

# Display the model
summary(fit_15.7)
## 
## Call:
## vglm(formula = re78 ~ treat + age + married + sample + educ_cat4 + 
##     educ + black, family = tobit(), data = lalonde)
## 
## Coefficients: 
##                 Estimate Std. Error  z value Pr(>|z|)    
## (Intercept):1 -1.074e+04  7.920e+02  -13.564  < 2e-16 ***
## (Intercept):2  9.344e+00  5.635e-03 1658.095  < 2e-16 ***
## treat          3.513e+03  9.774e+02    3.594 0.000325 ***
## age            5.480e+01  8.649e+00    6.337 2.34e-10 ***
## married        4.812e+03  2.132e+02   22.574  < 2e-16 ***
## sample         6.567e+03  2.549e+02   25.761  < 2e-16 ***
## educ_cat4      7.102e+02  1.889e+02    3.759 0.000170 ***
## educ           4.287e+02  6.716e+01    6.383 1.74e-10 ***
## black         -3.165e+03  2.962e+02  -10.688  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: mu, loglink(sd)
## 
## Log-likelihood: -176913.7 on 37325 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):2'

The interpretation would be similar to a linear model. Most of the predictors have a positive result on post treatment earnings. Race could play a negative role if the person is black. the intercept is the constant for the model and the intercept 2 is the ancillary statistic.

15.8 Robust linear regression using the t model:

The folder Congress has the votes for the Democratic and Republican candidates in each U.S. congressional district in 1988, along with the parties’ vote proportions in 1986 and an indicator for whether the incumbent was running for reelection in 1988. For your analysis, just use the elections that were contested by both parties in both years.

congress = read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Congress/data/congress.csv")
congress88 <- data.frame(vote=congress$v88_adj,pastvote=congress$v86_adj,inc=congress$inc88)

(a)

Fit a linear regression using stan_glm with the usual normal-distribution model for the errors predicting 1988 Democratic vote share from the other variables and assess model fit.

# Fit the model
fit_15.8a <- stan_glm(vote ~ pastvote + inc, data = congress88, refresh = 0)

# Predictive posterior check
pp_check(fit_15.8a)

# Residual vs predicted plot
pred = predict(fit_15.8a, type="response")
resid = resid(fit_15.8a)
plot(pred, resid, xlab="predicted", ylab="residual",
     main="Residuals vs Predicted values", pch=20)

The fit of the model is not bad as we can see from the figures. There does not seem to be any overdispersion detected from the residual plot as the residuals are evenly dispersed in a bimodal pattern.

(b)

Fit the same sort of model using the brms package with a t distribution, using the brm function with the student family. Again assess model fit.

# Fit the model
fit_15.8b <- brm(vote ~ pastvote + inc, data = congress88, refresh = 0)
## Compiling Stan program...
## Trying to compile a simple C file
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.1/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## Start sampling
# Predictive posterior check
pp_check(fit_15.8b)
## Using 10 posterior draws for ppc type 'dens_overlay' by default.

# Residual vs predicted plot
pred = predict(fit_15.8b, type="response")
resid = resid(fit_15.8b)
plot(pred, resid, xlab="predicted", ylab="residual",
     main="Residuals vs Predicted values", pch=20)

The fit of the model looks similar to the one in our previous model but the residuals are not showing two separate clusters so the model seems to fit the data better.

(c)

Which model do you prefer?

I prefer the robust regression model since it seems to be a better fit.

15.9 Robust regression for binary data using the robit model:

Use the same data as the previous example with the goal instead of predicting for each district whether it was won by the Democratic or Republican candidate.

(a)

Fit a standard logistic or probit regression and assess model fit.

# Fit the model
congress88$win <- ifelse(congress88$vote>0.5,1,0)
fit_15.9a <- stan_glm(win ~ pastvote + inc, family = binomial(link = "logit"), data = congress88, refresh = 0)

# Predictive posterior check
pp_check(fit_15.9a)

# Binned Residual plot
fitted = fitted(fit_15.9a)
resid = resid(fit_15.9a)
binnedplot(fitted, resid, xlab="fitted", ylab="residual",
     main="Binned Residual Plot")